Skip to content

Conversation

@AviadCo
Copy link
Contributor

@AviadCo AviadCo commented Oct 27, 2024

  • During operation tiling using scf, we may avoid inserting affine.min to handle the last tile where upper_bound = step * k where stride may be a constant or a dynamic.

@llvmbot
Copy link
Member

llvmbot commented Oct 27, 2024

@llvm/pr-subscribers-mlir-scf
@llvm/pr-subscribers-mlir-linalg

@llvm/pr-subscribers-mlir

Author: Aviad Cohen (AviadCo)

Changes
  • During operation tiling using scf, we may avoid inserting affine.min to handle the last tile where upper_bound = step * k where stride may be a constant or a dynamic.

Full diff: https://github.com/llvm/llvm-project/pull/113819.diff

2 Files Affected:

  • (modified) mlir/lib/Dialect/SCF/Transforms/TileUsingInterface.cpp (+24)
  • (modified) mlir/test/Dialect/Linalg/transform-op-tile.mlir (+47)
diff --git a/mlir/lib/Dialect/SCF/Transforms/TileUsingInterface.cpp b/mlir/lib/Dialect/SCF/Transforms/TileUsingInterface.cpp
index e2feb10b314540..54afbe4b032bd4 100644
--- a/mlir/lib/Dialect/SCF/Transforms/TileUsingInterface.cpp
+++ b/mlir/lib/Dialect/SCF/Transforms/TileUsingInterface.cpp
@@ -186,11 +186,35 @@ static void checkSafeToTileToForall(TilingInterface op,
   }
 }
 
+/// Returns true if `size` is dynamic multiplication of `stride`.
+/// i.e. , `size = stride * k` where stride may be a constant or a dynamic.
+static bool dynamiclyDivisible(OpFoldResult size, OpFoldResult stride) {
+  Value dynamicSize = dyn_cast_if_present<Value>(size);
+  if (!dynamicSize)
+    return false;
+  auto mulOp = dynamicSize.getDefiningOp<arith::MulIOp>();
+  if (!mulOp)
+    return false;
+  if (Value dynamicStride = dyn_cast_if_present<Value>(stride))
+    return mulOp.getLhs() == dynamicStride || mulOp.getRhs() == dynamicStride;
+  std::optional<int64_t> strideAsInt = getConstantIntValue(stride);
+  std::optional<int64_t> lhsAsInt = getConstantIntValue(mulOp.getLhs());
+  std::optional<int64_t> rhsAsInt = getConstantIntValue(mulOp.getRhs());
+  if (strideAsInt && lhsAsInt && *strideAsInt == *lhsAsInt)
+    return true;
+  if (strideAsInt && rhsAsInt && *strideAsInt == *rhsAsInt)
+    return true;
+  
+  return false;
+}
+
 /// Check if `stride` evenly divides the trip count `size - offset`.
 static bool tileDividesIterationDomain(Range loopRange) {
   std::optional<int64_t> offsetAsInt = getConstantIntValue(loopRange.offset);
   if (!offsetAsInt)
     return false;
+  if (*offsetAsInt == 0 && dynamiclyDivisible(loopRange.size, loopRange.stride))
+    return true;
   std::optional<int64_t> sizeAsInt = getConstantIntValue(loopRange.size);
   if (!sizeAsInt)
     return false;
diff --git a/mlir/test/Dialect/Linalg/transform-op-tile.mlir b/mlir/test/Dialect/Linalg/transform-op-tile.mlir
index 7bac850d0b7fe9..ade523ef378f36 100644
--- a/mlir/test/Dialect/Linalg/transform-op-tile.mlir
+++ b/mlir/test/Dialect/Linalg/transform-op-tile.mlir
@@ -266,3 +266,50 @@ func.func @tile_linalg_matmul(
     -> tensor<128x128xf32>
   return %0 : tensor<128x128xf32>
 }
+
+// -----
+
+#map = affine_map<(d0) -> (d0)>
+
+// CHECK-LABEL: splited_dynamic_linalg_generic
+func.func @splited_dynamic_linalg_generic(%arg0: tensor<?xi16>, %arg1: tensor<?xi16>) -> tensor<?xi16> {
+  %c80 = arith.constant 80 : index
+  %c0 = arith.constant 0 : index
+  %dim = tensor.dim %arg1, %c0 : tensor<?xi16>
+  %0 = tensor.empty(%dim) : tensor<?xi16>
+  %1 = arith.divui %dim, %c80 : index
+  %2 = arith.muli %1, %c80 : index
+  %3 = arith.remui %dim, %c80 : index
+  %extracted_slice = tensor.extract_slice %arg0[0] [%2] [1] : tensor<?xi16> to tensor<?xi16>
+  %extracted_slice_0 = tensor.extract_slice %arg1[0] [%2] [1] : tensor<?xi16> to tensor<?xi16>
+  %extracted_slice_1 = tensor.extract_slice %0[0] [%2] [1] : tensor<?xi16> to tensor<?xi16>
+  // CHECK: scf.for
+  // CHECK-NOT: affine.min
+  %4 = linalg.generic {indexing_maps = [#map, #map, #map], iterator_types = ["parallel"]} ins(%extracted_slice, %extracted_slice_0 : tensor<?xi16>, tensor<?xi16>) outs(%extracted_slice_1 : tensor<?xi16>) {
+  ^bb0(%in_1: i16, %in_2: i16, %out: i16):
+    %6 = arith.addi %in_1, %in_2 : i16
+    linalg.yield %6 : i16
+  } -> tensor<?xi16>
+  %inserted_slice = tensor.insert_slice %4 into %0[%2] [%2] [1] : tensor<?xi16> into tensor<?xi16>
+  %extracted_slice_2 = tensor.extract_slice %arg0[%2] [%3] [1] : tensor<?xi16> to tensor<?xi16>
+  %extracted_slice_3 = tensor.extract_slice %arg1[%2] [%3] [1] : tensor<?xi16> to tensor<?xi16>
+  %extracted_slice_4 = tensor.extract_slice %0[%2] [%3] [1] : tensor<?xi16> to tensor<?xi16>
+  // CHECK-NOT: scf.for
+  %5 = linalg.generic {indexing_maps = [#map, #map, #map], iterator_types = ["parallel"]} ins(%extracted_slice_2, %extracted_slice_3 : tensor<?xi16>, tensor<?xi16>) outs(%extracted_slice_4 : tensor<?xi16>) {
+  ^bb0(%in_1: i16, %in_2: i16, %out: i16):
+    %7 = arith.addi %in_1, %in_2 : i16
+    linalg.yield %7 : i16
+  } -> tensor<?xi16>
+  %inserted_slice_0 = tensor.insert_slice %5 into %inserted_slice[%2] [%3] [1] : tensor<?xi16> into tensor<?xi16>
+  return %inserted_slice_0 : tensor<?xi16>
+}
+
+
+module attributes {transform.with_named_sequence} {
+transform.named_sequence @__transform_main(%arg1: !transform.any_op {transform.readonly}) {
+    %0 = transform.structured.match ops{["linalg.generic"]} in %arg1 : (!transform.any_op) -> !transform.any_op
+    %const = transform.structured.match ops{["arith.constant"]} in %arg1 : (!transform.any_op) -> !transform.any_op
+    %1, %loop = transform.structured.tile_using_for %0 tile_sizes [%const] : (!transform.any_op, !transform.any_op) -> (!transform.any_op, !transform.any_op)
+    transform.yield
+}
+}

@github-actions
Copy link

github-actions bot commented Oct 27, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.

@AviadCo AviadCo force-pushed the tiling/avoid-min-on-dynamic branch from d88f749 to 33ba8d6 Compare October 27, 2024 18:49
Copy link
Contributor

@MaheshRavishankar MaheshRavishankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to this as a cleanup on `

}
}

/// Returns true if `size` is dynamic multiplication of `stride`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I follow. The tileDividesIterationDomain is checking for (ub - lb) % tilesize = 0. Why doesnt that catch the case you are trying to cover here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MaheshRavishankar
you are right, my change adds dynamic support for (ub - lb) % tilesize where lb is zero, and ub is dynamicly defined by possibly_dynamic_ub = possibly_dynamic_tilesize * possibly_dynamic_stride.
Some background for this change - I am using split on dynamic operations (i.e. linalg::generic) where the I use arith.divui to calculate a tile which fits my cache and another part of arith.remui if the dynamic size won't be perfectly divided.
After the split, the first part size is defined by arith.muli where one of its operands is the tile size I am using to tile the operation - but the the ub is now considered dynamic although we can know the tile size divides it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I see what you are trying. I am not a fan of this kind of piecemeal approaches. I tried something. I changed your input to use affine maps. Still didnt work. I get something like this

%1 = affine.apply affine_map<()[s0] -> ((s0 floordiv 80) * 80)>()[%dim]
%3 = scf.for %arg2 = %c0 to %1 step %c80 iter_args(%arg3 = %extracted_slice_1) -> (tensor<?xi16>) {
      %5 = affine.min affine_map<(d0)[s0] -> (-d0 + (s0 floordiv 80) * 80, 80)>(%arg2)[%dim]

IIRC adding --scf-for-loop-canonicalization should have fixed the problem, but it didnt

cc @matthias-springer I thought this was fixed at some time.

Anyway, I am OK with this patch overall given where things are, but this needs to account for the lower bound of the loop as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't looked into this pass for a long time. What I vaguely remember is that -scf-for-loop-canonicalization has always been a bit fragile due to limitations around semi-affine expressions in FlatAffineValueConstraints.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MaheshRavishankar thanks for the replay!
I tried to take a look at scf-for-loop-canonicalization which I was unfamiliar with, as for now I didn't manage to find the issue why it does not manage to remove the affine.min in this case. I updated the MR to also handle none 0 offset.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MaheshRavishankar can you please take a look on the latest patch?

@AviadCo AviadCo force-pushed the tiling/avoid-min-on-dynamic branch from 33ba8d6 to d90fa69 Compare October 29, 2024 12:38
… sizes if possible

* During operation tiling using scf, we may avoid inserting affine.min
  to handle the last tile where `upper_bound = step * k` where stride may
be a constant or a dynamic.
@AviadCo AviadCo force-pushed the tiling/avoid-min-on-dynamic branch from d90fa69 to 7a5ab7c Compare October 29, 2024 12:42
Copy link
Contributor

@MaheshRavishankar MaheshRavishankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, this seems to be doing much more than what the original PR set out to do. I am fine with it, but I'd really like this logic simpler. I cant gaurantee that as written this is maintainable.

if (!strideAsInt)
return false;
return ((sizeAsInt.value() - offsetAsInt.value()) % strideAsInt.value() == 0);
if (strideAsInt && offsetAsInt && sizeAsInt)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Please add { } around multi-line statements.

return;

// Given `ofr` = `x` * `y`, all dividers of `x` and `y` are dividers of `ofr`.
collectDividers(mulOp.getLhs(), dividers);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recursions are generally discouraged.

std::optional<int64_t> dividerAsInt = getConstantIntValue(divider);
return dividerAsInt && *dividerAsInt % *strideAsInt == 0;
};
return llvm::any_of(dividersOfSize, isStrideDividesDivider) &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dont you think this logic is more complicated that you want it to be. I'd like to scope this a bit more narrowly. Just look for the immediate operations being a arith.mul and avoid the recursion.

The more general solution should be using something the IntegerDivisibilityAnalysis (here) that is very similar to the range inference analysis that can be used to fold this away. Doing this in transformations like this becomes unmaintainable in the long run.

@AviadCo
Copy link
Contributor Author

AviadCo commented Nov 14, 2024

@MaheshRavishankar per the discussion with you I decided to abandon the approach of searching patterns in arith operations as it may be too fragile like you mentioned. Instead I am avoiding the "split" before tiling and handling the affine.min accordingly.
I will also investigate IntegerDivisibilityAnalysis &scf-for-loop-canonicalization as those seems intresting for my flow, I appreciate your review!

@AviadCo AviadCo closed this Nov 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants